Search for: All records

Editors contains: "Birol, Inanc"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multidimensional scaling improves distance-based clustering for microbiome data

https://doi.org/10.1093/bioinformatics/btaf042

Chen, Guanhua; Wang, Xinyue; Sun, Qiang; Tang, Zheng-Zheng (February 2025, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation:Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario. Results:We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications. Availability and implementation:The R package MDSMClust is available at https://github.com/wxy929/MDS-project.
more » « less
Free, publicly-accessible full text available February 1, 2026
Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling

https://doi.org/10.1093/bioinformatics/btad186

Kaste, Joshua A; Shachar-Hill, Yair (May 2023, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation The accurate prediction of complex phenotypes such as metabolic fluxes in living systems is a grand challenge for systems biology and central to efficiently identifying biotechnological interventions that can address pressing industrial needs. The application of gene expression data to improve the accuracy of metabolic flux predictions using mechanistic modeling methods such as flux balance analysis (FBA) has not been previously demonstrated in multi-tissue systems, despite their biotechnological importance. We hypothesized that a method for generating metabolic flux predictions informed by relative expression levels between tissues would improve prediction accuracy. Results Relative gene expression levels derived from multiple transcriptomic and proteomic datasets were integrated into FBA predictions of a multi-tissue, diel model of Arabidopsis thaliana’s central metabolism. This integration dramatically improved the agreement of flux predictions with experimentally based flux maps from 13C metabolic flux analysis compared with a standard parsimonious FBA approach. Disagreement between FBA predictions and MFA flux maps was measured using weighted averaged percent error values, and for parsimonious FBA this was169%–180% for high light conditions and 94%–103% for low light conditions, depending on the gene expression dataset used. This fell to 10%-13% and 9%-11% upon incorporating expression data into the modeling process, which also substantially altered the predicted carbon and energy economy of the plant. Availability and implementation Code and data generated as part of this study are available from https://github.com/Gibberella/ArabidopsisGeneExpressionWeights.
more » « less
Full Text Available
NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data

https://doi.org/10.1093/bioinformatics/btac824

Huang, Neng; Xu, Minghua; Nie, Fan; Ni, Peng; Xiao, Chuan-Le; Luo, Feng; Wang, Jianxin (January 2023, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation Oxford Nanopore sequencing has great potential and advantages in population-scale studies. Due to the cost of sequencing, the depth of whole-genome sequencing for per individual sample must be small. However, the existing single nucleotide polymorphism (SNP) callers are aimed at high-coverage Nanopore sequencing reads. Detecting the SNP variants on low-coverage Nanopore sequencing data is still a challenging problem. Results We developed a novel deep learning-based SNP calling method, NanoSNP, to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. In this method, we design a multi-step, multi-scale and haplotype-aware SNP detection pipeline. First, the pileup model in NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-long short-term memory (LSTM) network. These SNP sites are phased and used to divide the low-coverage Nanopore reads into different haplotypes. Finally, the long-range haplotype feature and short-range pileup feature are extracted from each haplotype. The haplotype model combines two features and predicts the genotype for the candidate site using a Bi-LSTM network. To evaluate the performance of NanoSNP, we compared NanoSNP with Clair, Clair3, Pepper-DeepVariant and NanoCaller on the low-coverage (∼16×) Nanopore sequencing reads. We also performed cross-genome testing on six human genomes HG002–HG007, respectively. Comprehensive experiments demonstrate that NanoSNP outperforms Clair, Pepper-DeepVariant and NanoCaller in identifying SNPs on low-coverage Nanopore sequencing data, including the difficult-to-map regions and major histocompatibility complex regions in the human genome. NanoSNP is comparable to Clair3 when the coverage exceeds 16×. Availability and implementation https://github.com/huangnengCSU/NanoSNP.git. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
CellMeSH: probabilistic cell-type identification using indexed literature

https://doi.org/10.1093/bioinformatics/btab834

Mao, Shunfu; Zhang, Yue; Seelig, Georg; Kannan, Sreeram (February 2022, Bioinformatics)
Birol, Inanc (Ed.)
Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. ResultsHere, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. Availability and implementationWeb server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Full Text Available
PPIT: an R package for inferring microbial taxonomy from nifH sequences

https://doi.org/10.1093/bioinformatics/btab100

Kapili, Bennett J; Dekas, Anne E (February 2021, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Compression of quantification uncertainty for scRNA-seq counts

https://doi.org/10.1093/bioinformatics/btab001

Van Buren, Scott; Sarkar, Hirak; Srivastava, Avi; Rashid, Naim U; Patro, Rob; Love, Michael I (January 2021, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. Results We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. Availability and implementation makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Bipartite Tight Spectral Clustering (BiTSC) Algorithm for Identifying Conserved Gene Co-clusters in Two Species

https://doi.org/10.1093/bioinformatics/btaa741

Sun, Yidan Eden; Zhou, Heather J; Li, Jingyi Jessica (August 2020, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Motivation Gene clustering is a widely-used technique that has enabled computational prediction of unknown gene functions within a species. However, it remains a challenge to refine gene function prediction by leveraging evolutionarily conserved genes in another species. This challenge calls for a new computational algorithm to identify gene co-clusters in two species, so that genes in each co-cluster exhibit similar expression levels in each species and strong conservation between the species. Results Here we develop the bipartite tight spectral clustering (BiTSC) algorithm, which identifies gene co-clusters in two species based on gene orthology information and gene expression data. BiTSC novelly implements a formulation that encodes gene orthology as a bipartite network and gene expression data as node covariates. This formulation allows BiTSC to adopt and combine the advantages of multiple unsupervised learning techniques: kernel enhancement, bipartite spectral clustering, consensus clustering, tight clustering, and hierarchical clustering. As a result, BiTSC is a flexible and robust algorithm capable of identifying informative gene co-clusters without forcing all genes into co-clusters. Another advantage of BiTSC is that it does not rely on any distributional assumptions. Beyond cross-species gene co-clustering, BiTSC also has wide applications as a general algorithm for identifying tight node co-clusters in any bipartite network with node covariates. We demonstrate the accuracy and robustness of BiTSC through comprehensive simulation studies. In a real data example, we use BiTSC to identify conserved gene co-clusters of D. melanogaster and C. elegans, and we perform a series of downstream analysis to both validate BiTSC and verify the biological significance of the identified co-clusters. Availability and implementation The Python package BiTSC is open-access and available at https://github.com/edensunyidan/BiTSC.
more » « less
Full Text Available
Ribbon: intuitive visualization for complex genomic variation

https://doi.org/10.1093/bioinformatics/btaa680

Nattestad, Maria; Aboukhalil, Robert; Chin, Chen-Shan; Schatz, Michael C (August 2020, Bioinformatics)
Birol, Inanc (Ed.)
Abstract Summary Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons. Availability and implementation Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available